Reference Manual

Wednesday, June 21, 2023

Apache Spark Installation on Azure Ubuntu VM

Installation of Apache Spark on Ubuntu VM Server on Azure

Create VMs

First step is to create VMs on Azure, I am not going into the details of that. Once done, please try to remember the User ID and password.

Installation of Putty

In order to login to the box, we need to install the putty for windows, i used the link below

http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

After installation, open putty and enter the public IP address as below and it will open a screen where it will ask the username and password. I have not set up password less authentication yet on the VMs

Download Spark into desired folder

For that we need to first create the folder where we want the binary download of spark to get downloaded

plasparkadmin@13.65.203.139's password:

Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-72-generic x86_64)

* Documentation: https://help.ubuntu.com/

System information as of Tue Apr 11 18:55:11 UTC 2017

System load: 0.24 Memory usage: 0% Processes: 87

Usage of /: 40.9% of 1.94GB Swap usage: 0% Users logged in: 0

Graph this data and manage this system at:

https://landscape.canonical.com/

Get cloud support with Ubuntu Advantage Cloud Guest:

http://www.ubuntu.com/business/services/cloud

0 packages can be updated.

0 updates are security updates.

Your Hardware Enablement Stack (HWE) is supported until April 2019.

The programs included with the Ubuntu system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

applicable law.

plasparkadmin@pal-dev-vm-01:~$ cd ..

plasparkadmin@pal-dev-vm-01:/home$ ls

plasparkadmin

plasparkadmin@pal-dev-vm-01:/home$ cd plasparkadmin/

plasparkadmin@pal-dev-vm-01:~$ mkdir work

plasparkadmin@pal-dev-vm-01:~$ ls

work

plasparkadmin@pal-dev-vm-01:~$ chmod 777 -R work

plasparkadmin@pal-dev-vm-01:~$ ls

work

plasparkadmin@pal-dev-vm-01:~$ cd work

plasparkadmin@pal-dev-vm-01:~/work$ wget http://d3kbcqa49mib13.cloudfront.net/sp ark-2.1.0-bin-hadoop2.7.tgz

--2017-04-12 16:55:57-- http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-ha doop2.7.tgz

Resolving d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)... 54.23 0.5.90, 54.230.5.79, 54.230.5.12, ...

Connecting to d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)|54.2 30.5.90|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 195636829 (187M) [application/x-tar]

Saving to: ‘spark-2.1.0-bin-hadoop2.7.tgz’

100%[======================================>] 195,636,829 26.2MB/s in 7.7s

2017-04-12 16:56:05 (24.1 MB/s) - ‘spark-2.1.0-bin-hadoop2.7.tgz’ saved [1956368 29/195636829]

Now extract the downloaded file as follows

plasparkadmin@pal-dev-vm-01:~/work$ ls

spark-2.1.0-bin-hadoop2.7.tgz

plasparkadmin@pal-dev-vm-01:~/work$ tar -xvf spark-2.1.0-bin-hadoop2.7.tgz

Download Java

Go to the Oracle site and download JDK 8, in my case from the ssh I did

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u121-b13/e9e7ea248e2c4826b92b3f075a80e441/jdk-8u121-linux-x64.tar.gz"

The above command should download and install the latest version of JAVA.

extract it with the command : tar xzf jdk-8u121-linux-x64.tar.gz

Setting Environment Paths

We need to set up JAVA_HOME, SPARK_HOME AND HADOOP_HOME as below from the command line

plasparkadmin@pal-dev-vm-01:~$ export JAVA_HOME=/home/plasparkadmin/work/jdk1.8.0_121

plasparkadmin@pal-dev-vm-01:~$ export PATH=$JAVA_HOME/bin:$PATH

plasparkadmin@pal-dev-vm-01:~$ echo $JAVA_HOME

/home/plasparkadmin/work/jdk1.8.0_121

plasparkadmin@pal-dev-vm-01:~$ export SPARK_HOME=/home/plasparkadmin/work/spark- 2.1.0-bin-hadoop2.7

plasparkadmin@pal-dev-vm-01:~$ export PATH=$SPARK_HOME/bin:$PATH

plasparkadmin@pal-dev-vm-01:~$ echo $SPARK_HOME

/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7

plasparkadmin@pal-dev-vm-01:~$ export HADOOP_HOME=/home/plasparkadmin/work/spark -2.1.0-bin-hadoop2.7

plasparkadmin@pal-dev-vm-01:~$ echo $HADOOP_HOME

/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7

It might be prudent to set up the above paths into .profile along with .bashrc

Command : vi ~/.profile

After opening scroll to the end of the file and press CNTRL + I, this will open the file in insert mode.

Type the paths in and then press ESC when done. This will change the file from Insert mode to normal mode.

To save and exit the file type = :x

To quit without saving type= :q

Install Python Tools

To make python development interactive, we might need to install some python development tools like iPython and Jupytar notebook. Here is a site below which had detailed instructions on how to get that done and I was able to follow that with ease.

https://www.digitalocean.com/community/tutorials/how-to-set-up-a-jupyter-notebook-to-run-ipython-on-ubuntu-16-04

Listing out the main commands. At first try to update the ubuntu version from terminal with the command :

$ sudo apt-get update

Please note that on first time run of the above command I faced an issue and the error was like below

E: Could not get lock /var/lib/dpkg/lock - open (11 Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?

I was able to resolve the above error with the following commands

Remove your /var/lib/dpkg/lock file and force package reconfiguration.

sudo rm /var/lib/dpkg/lock
sudo dpkg --configure -a

It should work after this.

And the steps are explained above which I copied from stackoverflow.

Once your ubuntu is updated now install pip and python dev as follows

$ sudo apt-get -y install python-pip python-dev

Once done you can check the installation location

$pip --version

pip 8.1.1 from /usr/lib/python2.7/dist-packages (python 2.7)

$python --version

Python 2.7.12

$whereis python2.7

python2: /usr/bin/python2.7-config /usr/bin/python2 /usr/bin/python2.7 /usr/lib/python2.7 /etc/python2.7 /usr/local/lib/python2.7 /usr/include/python2.7 /usr/share/man/man1/python2.1.gz

once done, we can proceed with installation of ipython and notebook

Since we are working from ssh so notebook might not work, but here are the steps below for its installation.

$ sudo apt-get -y install ipython ipython-notebook

Now we can move on to installing Jupyter Notebook:


sudo -H pip install jupyter

Depending on what version of pip is in the Ubuntu apt-get repository, you might get the following error when trying to install Jupyter:



You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

If so, you can use pip to upgrade pip to the latest version:


sudo -H pip install --upgrade pip

Upgrade pip, and then try installing Jupyter again:


sudo -H pip install jupyter

Try out by typing

$ jupyter notebook

Installation done

Setting up worker node/cluster

At first we will install open-ssh server on the worker node so that it can remotely logged into from the master node. So open a terminal on the worker node and then type the following installation commands

# On Worker nodes, we install SSH Server so that we can access this node from Master node

sudo apt-get install openssh-server

After installing the SSH server on worker node, generate a key from the master so that master can access the worker node without asking for a password.

On the master node, we generate rsa key for remote access to the worker node.

# On Master node, we generate a rsa key for remote access

ssh-keygen

After the above command just keep pressing enter until the pass phrase gets generated

kghosh@DVY1L32-ubuntu-1:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/kghosh/.ssh/id_rsa):
Created directory '/home/kghosh/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/kghosh/.ssh/id_rsa.
Your public key has been saved in /home/kghosh/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:B8GWGJ+vSRKSr6L+Y+QeBvGr5Rljie4Crr+Lztc4csM kghosh@DVY1L32-ubuntu-1
The key's randomart image is:
+---[RSA 2048]----+
|      .+..       |
|     ...+o       |
| . o ..+        |
| o o . o       |
| . . o S o      |
|. o.o. o +       |
|o.=X+   o        |
|+=*E=.           |
|XO@B+            |
+----[SHA256]-----+

The Terminal will list out the passphrase file where it got saved as in the screen shot above.

We need to now copy each the passphrase to each of the worker node machines and the generic command for that is as below

# To access Worker ncd odes via SSH without providing password (just use our rsa key), we need to copy our public key to each Worker node

ssh-copy-id -i ~/.ssh/id_rsa.pub <username_on_remote_machine>@<IP_address_of_that_remote_machine>

You can get the Ip address of the remote machine by typing ifconfig from the terminal. So in our case that command is as follows

ssh-copy-id -i ~/.ssh/id_rsa.pub kghosh@192.168.1.64

Output-
kghosh@DVY1L32-ubuntu-1:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub kghosh@192.168.1.164
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/kghosh/.ssh/id_rsa.pub"
The authenticity of host '192.168.1.164 (192.168.1.164)' can't be established.
ECDSA key fingerprint is SHA256:jpsnVaUNquTNRqiuKqLGGyR3AYTp/tqIneCJf5ZWcDI.
Are you sure you want to continue connecting (yes/no)? y
Please type 'yes' or 'no': yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
kghosh@192.168.1.164's password:

Number of key(s) added: 1

Next, copy this file from the master to the the slave machine using

scp <source> <destination>

To copy a file from B to A while logged into B:

scp .ssh/id_rsa.pub kghosh@192.168.1.164:/home/kaushik/.ssh/id_rsa.pub

and then from slave machine run:

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ chmod 644 ~/.ssh/authorized_keys

Spark Configuration

Go to the config folder of spark in the master machine and copy the file slaves.template to slaves and specify the ip address of the slave machine

plasparkadmin@pal-dev-vm-01:~/work$ cd spark-2.1.0-bin-hadoop2.7/

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ls

bin data jars licenses python README.md sbin

conf examples LICENSE NOTICE R RELEASE yarn

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ cd conf/

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ ls

docker.properties.template metrics.properties.template spark-env.sh.template

fairscheduler.xml.template slaves.template

log4j.properties.template spark-defaults.conf.template

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ cp slaves.template slaves

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ ls

docker.properties.template slaves

fairscheduler.xml.template slaves.template

log4j.properties.template spark-defaults.conf.template

metrics.properties.template spark-env.sh.template

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ vi slaves

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$

Worker Configurations

For each worker from the master copy the jdk and spark binary folder recursively using scp command

plasparkadmin@pal-dev-vm-01:~/work$ scp -r jdk1.8.0_121/ plasparkadmin@13.85.14.23:/home/plasparkadmin/work/

plasparkadmin@pal-dev-vm-01:~/work$ scp -r spark-2.1.0-bin-hadoop2.7/ plasparkadmin@13.85.14.23:/home/plasparkadmin/work/

Once the copy is done, we will need to setup the Java Home path and the Spark and Hadoop Home paths just like we did in the master.

Just like in the master we can directly write set the paths in the terminal on the worker.

Python Dev Tools on the Worker

Install python on the worker, so that we can submit pyspark jobs.

$ sudo apt-get update

$ sudo apt-get -y install python-pip python-dev

Installing Remote Desktop to Master and Minimum GUI

Most the steps are copied from the link in the reference. These are the steps I did below to enable remote desktop and load the spark master ui.

First thing to enable is the Port for Remote Desktop Connection on Azure. So login to Azure portal and under the virtual network of the Master VM, Enable Port TCP :3389.

Install xfce, use:

Copy

Code

#sudo apt-get install xubuntu-desktop

Then enable xfce, use:

Copy

Code

#echo xfce4-session >~/.xsession

For Ubuntu to install xrdp use:

Copy

Code

#sudo apt-get install xrdp

Edit the config file /etc/xrdp/startwm.sh, use:

Copy

Code

#sudo vi /etc/xrdp/startwm.sh

Add line xfce4-session before the line /etc/X11/Xsession.

Restart xrdp service, use:

Copy

Code

#sudo service xrdp restart

Connect your Linux VM from a Windows machine

In a Windows machine, start the remote desktop client(Remote Desktop Connection), input your Linux VM DNS name, or go to Dashboard of your VM in Azure classic portal and click Connect to connect your Linux VM, you will see below login window:

Login with the user & password of your Linux VM, and enjoy the Remote Desktop from your Microsoft Azure Linux VM right now!

Sample Python Application and PySpark Submission

SSH into the master and navigate to the spark home directory. Once you are there create a project folder, in our case its called kaushikpla. On the terminal type

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ cd kaushikpla/

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ vi plaPythonApp.py

Copy paste the below pyspark code from local computer

from pyspark import SparkConf

from pyspark import SparkContext

from pyspark.sql import SQLContext

from pyspark.sql.types import *

def main():

sc = SparkContext()

sqlContext = SQLContext(sc)

eventPath = "wasb://sparkjob@pladevstorage.blob.core.windows.net/input/events.log"

eventsJson = sqlContext.read.json(eventPath)

resultDF = eventsJson.groupBy(['EventDateTime', 'EventTypeName', 'PN']).count()

resultDF.coalesce(1).write.format('csv').options(header='true').save('wasb://sparkjob@pladevstorage.blob.core.windows.net/outputvmcsv')

resultDF.coalesce(1).write.format('json').save('wasb://sparkjob@pladevstorage.blob.core.windows.net/outputvmjson')

if __name__ == "__main__":

main()

Once that is done, go back to the spark home directory and submit the job with the below command.

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ./bin/spark-submit --master spark://10.0.0.4:7077 kaushikpla/plaPythonApp.py

Install Microsoft SQL JDBC Driver

In order to connect to SQL Server, we need to install Microsoft SQL JDBC Driver. So download the latest driver from the link below

https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774

Download the file : sqljdbc_6.0.8112.100_enu.tar.gz and unpack it using the following command

plasparkadmin@pal-dev-vm-01:~/work$ gzip -d sqljdbc_6.0.8112.100_enu.tar.gz
plasparkadmin@pal-dev-vm-01:~/work$ tar -xf sqljdbc_6.0.8112.100_enu.tar

Once its extracted, we proceed to adding it to the classpath of spark and write a sample code to test.

To add the driver class path, I added the *.jar file to the $SPARK_HOME\conf\spark-defaults.conf

If there are already entries then put a comma and then add the entry below

/home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar

So my final spark-defaults.conf looked like below

spark.jars=/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7/lib/hadoop-azure-2.7.0.jar,/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7/lib/azure-storage-2.0.0.jar,/home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar

scp the sqljdbc folder to the workers and add the same class path in each workers and restart the cluster.

Testing Connection to SQL Server

Write the below sample application and save it and run it with the execution command below

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ./bin/spark-submit --driver-class-path /home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar --master spark://10.0.0.4:7077 kaushikpla/plaRollupToSQL.py

please note above that I have added the driver class path in the execution command because it was not able to find the driver without that.

from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext

def main():
conf = (SparkConf().setAppName("data_import"))
sc = SparkContext(conf = conf)
#sc = SparkContext()
sqlContext = SQLContext(sc)
jdbcDF = sqlContext.read.format("jdbc").option("url", "jdbc:sqlserver://svr.database.windows.net;databaseName=dbname").option("dbtable", "dbo.mytable").option("user", "username").option("password", "password").load()
# Displays the content of the DataFrame to stdout ...first 10 rows
jdbcDF.show(10)

if __name__ == "__main__":
main()

Writing to Azure SQL Database

Just like you can read from Azure SQL, you can also write to Azure SQL directly from the dataframe, a sample code as below

viewpath = "wasb://ap1rpt@dawstopla.blob.core.windows.net/events_dt.csv"
df = edf.select('ET', 'EventCount').groupBy('ET').agg(func.sum("EventCount")).withColumnRenamed('sum(EventCount)', 'EventCount')
df.withColumn('EventDate', lit(eventDate)).coalesce(1).write.format("csv").mode("overwrite").options(header='true').save(viewpath)
df.withColumn('EventDate', lit(eventDate)).write.jdbc(url="jdbc:sqlserver://server.database.windows.net;databaseName=mydb", table="dbo.mytable", mode="append", properties={"user": "username", "password":"password"})

*For the above to wotk, please make sure the select columns and the database column names in the database are same. mode = "append" will mean it will append to the existing data that is there in the database.

Python Module for MSSQL Client

There is a python module for MsSQL client : http://www.pymssql.org/en/stable/intro.html. To install that module do this below

sudo apt-get install freetds-dev

sudo pip install pymssql

Once installed we will run a spark job to write and read from database. I have not used it yet.

Sunday, August 8, 2021

Remote Login to Ubuntu 20 from Windows 10

Remote login to Ubuntu 20 from Windows 10

Making a note on how to enable remote login to a desktop/notebook having ubuntu 20 installed from windows. As of first steps there are some prerequisites that we need to have and some software we will need to install on the ubuntu machine. I am assuming that you are having sudo access on the ubuntu machine. The steps are outlined as below

1. Login to the ubuntu machine. or ssh username@ip_address

2. Install XRDP Packages

sudo apt-get

sudo apt-get install xrdp

3. Install the desktop environment

sudo apt-get install xfce4

sudo apt-get install xfce-terminal

4. After installation, configure XRDP to use XFCE with below command

sudo sed -i.bak '/fi/a #xrdp multiple users configuration \n xfce-session \n' /etc/xrdp/startwm.sh

5. Allow RDP port in Firewall

sudo ufw allow 3389/tcp

6. Restart the Xrdp application

sudo /etc/init.d/xrdp restart

7. After doing the above steps I was logging in and getting a blank screen. The reason being that xrdp/startwm.sh was missing some configurations

sudo apt-get install xorg

sudo -s

vi /etc/xrdp/startwm.sh

8. Once the script filr opens add the lines below

unset DBUS_SESSION_BUS_ADDRESS

unset XDG_RUNTIME_DIR

.$HOME/.profile

When you open the file it will look as below

vi /etc/xrdp/startwm.sh

9. Finally, reboot the machine/VM after the configurations are set

reboot

Thursday, December 17, 2020

Installing Docker Tools on Ununtu VM on Azure to run Superset

Installing Docker Tools on Ubuntu VMs on Azure to Run Superset

As a first step, we are creating a VM on Azure through the portal. We make sure of 2 things in this example

Select Ubuntu 18.04
Make sure the ports 80, 443, 22, 3389 are open
Make sure the VM has a public ip address so that you can access it using Putty or other secured shell client from your local machine.

Since we are dealing with running Apache superset so we will first login and clone the repository and change the admin password

Cloning Superset from repository and change Default Password

Clone the official repository using

$ git clone https://github.com/apache/incubator-superset.git

The default password is "admin" so we can change it so that no one else knows it

$ cd incubator-superset

$ cd docker

$ vi docker-init.sh

#change the line to your desired password ADMIN_PASSWORD="admin"

Installing Docker

I have followed instructions listed on the url : https://docs.docker.com/engine/install/ubuntu/ using the method "install using the repository". Majority of the steps listed here will be from the url listed above and purpose of documenting this is to note down how to resolve the errors while doing a docker compose.

#SET UP THE REPOSITORY

#Update the apt package index and install packages to allow apt to use a repository over HTTPS:

$ sudo apt-get update

$ sudo apt-get install \

apt-transport-https \

ca-certificates \

curl \

gnupg-agent \

software-properties-common

Add Docker’s official GPG key:

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

#Verify that you now have the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88, by searching for the last 8 characters of the fingerprint.

$ sudo apt-key fingerprint 0EBFCD88

pub rsa4096 2017-02-22 [SCEA]

9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88

uid [ unknown] Docker Release (CE deb) <docker@docker.com>

sub rsa4096 2017-02-22 [S]

Use the following command to set up the stable repository. To add the nightly or test repository, add the word nightly or test (or both) after the word stable in the commands below. Learn about nightly and test channels.

$ sudo add-apt-repository \

"deb [arch=amd64] https://download.docker.com/linux/ubuntu \

$(lsb_release -cs) \

stable"

INSTALL DOCKER ENGINE

#Update the apt package index, and install the latest version of Docker Engine and containerd, or go to #the next step to install a specific version:

$ sudo apt-get update

$ sudo apt-get install docker-ce docker-ce-cli containerd.io

To install a specific version of Docker Engine, list the available versions in the repo, then select and install:

a. List the versions available in your repo:

$ apt-cache madison docker-ce

b. Install a specific version using the version string from the second column, for example, 5:18.09.1~3-0~ubuntu-xenial.

$ sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io

In my case it was

$ sudo apt-get install docker-ce=5:20.10.1~3-0~ubuntu-bionic docker-ce-cli=5:20.10.1~3-0~ubuntu-bionic containerd.io

Verify that Docker Engine is installed correctly by running the hello-world image.

$ sudo docker run hello-world

Install Docker Compose

You can install docker-compose using the command below

$ sudo apt install docker-compose

Compose Superset

we will try to compose the downloaded superset by the command below

$ docker-compose up

On running the above command you will face issue where it will complain about the version of the docker-compose.yaml. In order to resolve it, please change the version to 2.2

After making the above change when you run the docker-compose again it will still complain and the error should be "Couldn't connect to docker daemon". In order to resolve this we need to do the following

$export DOCKER_HOST=internalIp of the VM

$ sudo usermod -aG docker <<username>>

$ service docker restart

$ sudo docker-compose up -d

The -d option is to run it in a detached mode so even if your terminal session closes the service would still keep running. When installation is complete and the services are running you can check what are the services running using the command below

$docker-compose ps

:~/incubator-superset$ docker-compose ps

WARNING: The CYPRESS_CONFIG variable is not set. Defaulting to a blank string.

Name Command State Ports

--------------------------------------------------------------------------------

superset_app /usr/bin/docker- Up 8080/tcp,

entrypoint ... 0.0.0.0:8088->8088/tcp

superset_cache docker-entrypoint.sh Up 127.0.0.1:6379->6379/t

redis ... cp

superset_db docker-entrypoint.sh Up 127.0.0.1:5432->5432/t

postgres cp

superset_init /usr/bin/docker- Exit 0

entrypoint ...

superset_node docker-entrypoint.sh Up

/app/ ...

superset_tests_worker /usr/bin/docker- Exit 1

entrypoint ...

superset_worker /usr/bin/docker- Up 8080/tcp

entrypoint ...

Now you should be able to access the superset instance by http://{Private of Public IP}:8088 and login using the username = admin and password = <<password set in docker-init.sh>>

In order to stop the services from running use the command below

$ docker-compose stop

In order to remove the container

$ docker-compose down

Wednesday, July 29, 2020

Enabling remote desktop connection on ubuntu on Azure

Enable remote desktop connection on ubuntu VM on azure

There may be instances when you will need to setup a Linux/Ubuntu VM on a public cloud like azure do quickly do some Poc instead of bothering your own local machine which may not be linux based.

Create Ubuntu (v 18 in my case) VM (Resource Managed)

ssh into VM with Putty

sudo apt-get update

sudo apt-get install lxde

sudo apt-get install xrdp

echo startlxde > ~/.xsession

sudo /etc/init.d/xrdp start

open port 3389 in Azure firewall

RDP to Ubuntu desktop :)

Steps to Open Port 3389 on VM

Sign in to the Azure portal.
In Virtual Machines, select the VM that has the problem.
In Settings, select Networking.
In Inbound port rules, check whether the port for RDP is set correctly. The following is an example of the configuration:

Priority: 300 (set 310 if 300 is already taken)
Name: Port_3389
Port(Destination): 3389
Protocol: TCP
Source: Any
Destinations: Any
Action: Allow

All Set!

Friday, December 22, 2017

Creating Azure SQL Login and Assigning them permission

Creating Read-Only Users on Azure SQL

If you have admin rights please follow the following steps to create a read-only or a user with login who can just run select queries on azure sql.

1. Login to the Database Server as Admin and Select Master Database and run the following queries

--This will create a Login on the Server

CREATE LOGIN READ_USER WITH PASSWORD = 'StrongPassword';

2. Create User in the Database where the Read-Only permission is required

--Select the Database where you will be assigning the Read-only permission and run below command

CREATE USER READ_USER FROM LOGIN READ_USER ;

3. Assign db_datareader persmission to the user on the database

--Select the Database where you will be assigning read permission and run below query

EXEC sp_addrolemember 'db_datareader', 'READ_USER ';

Reference : https://azure.microsoft.com/en-us/blog/adding-users-to-your-sql-azure-database/

Saturday, October 22, 2016

Linking Jupyter Notebook with Spark on ubuntu 16

Starting Jupyter Notebook with Apache Spark

Required Variable Setup

Open a terminal and enter command to edit the profile variable

$ gedit ~/.bashrc

Once the window opens, enter the following two lines

export PYSPARK_DRIVER_PYTHON=ipython

export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Once done hit save and exit the terminal.

Running the Notebook with Spark Cluster

Assuming that its a local standalone cluster, we can start it using the following commands

$ pyspark --master local[2]

Monday, October 17, 2016

How To Set Up a Jupyter Notebook to Run IPython on Ubuntu 16.04

Copied from : https://www.digitalocean.com/community/tutorials/how-to-set-up-a-jupyter-notebook-to-run-ipython-on-ubuntu-16-04

Introduction

IPython is an interactive command-line interface to Python. Jupyter Notebook offers an interactive web interface to many languages, including IPython.
This article will walk you through setting up a server to run Jupyter Notebook as well as teach you how to connect to and use the notebook. Jupyter notebooks (or simply notebooks) are documents produced by the Jupyter Notebook app which contain both computer code (e.g. Python) and rich text elements (paragraph, equations, figures, links, etc.) which aid in presenting reproducible research.
By the end of this guide, you will be able to run Python 2.7 code using Ipython and Jupyter Notebook running on a remote server. For the purposes of this tutorial, Python 2 (2.7.x) is used since many of the data science, scientific computing, and high-performance computing libraries support 2.7 and not 3.0+.

Prerequisites

To follow this tutorial, you will need the following:

Ubuntu 16.04 Droplet
Non-root user with sudo privileges (Initial Server Setup with Ubuntu 16.04 explains how to set this up.)

All the commands in this tutorial should be run as a non-root user. If root access is required for the command, it will be preceded by sudo. Initial Server Setup with Ubuntu 16.04 explains how to add users and give them sudo access.

Step 1 — Installing Python 2.7 and Pip

In this section we will install Python 2.7 and Pip.
First, update the system's package index. This will ensure that old or outdated packages do not interfere with the installation.


sudo apt-get update

Next, install Python 2.7, Python Pip, and Python Development:


sudo apt-get -y install python2.7 python-pip python-dev

Installing python2.7 will update to the latest version of Python 2.7, and python-pip will install Pip which allows us to manage Python packages we would like to use. Some of Jupyter’s dependencies may require compilation, in which case you would need the ability to compile Python C-extensions, so we are installing python-dev as well.
To verify that you have python installed:


python --version

This will output:


Output
Python 2.7.11+

Depending on the latest version of Python 2.7, the output might be different.
You can also check if pip is installed using the following command:


pip --version

You should something similar to the following:


Output
pip 8.1.1 from /usr/lib/python2.7/dist-packages (python 2.7)

Similarly depending on your version of pip, the output might be slightly different.

Step 2 — Installing Ipython and Jupyter Notebook

In this section we will install Ipython and Jupyter Notebook.
First, install Ipython:


sudo apt-get -y install ipython ipython-notebook

Now we can move on to installing Jupyter Notebook:


sudo -H pip install jupyter

Depending on what version of pip is in the Ubuntu apt-get repository, you might get the following error when trying to install Jupyter:


Output
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

If so, you can use pip to upgrade pip to the latest version:


sudo -H pip install --upgrade pip

Upgrade pip, and then try installing Jupyter again:


sudo -H pip install jupyter

Step 3 — Running Jupyter Notebook

You now have everything you need to run Jupyter Notebook! To run it, execute the following command:


jupyter notebook

If you are running Jupyter on a system with JavaScript installed, it will still run, but it might give you an error stating that the Jupyter Notebook requires JavaScript:


Output
Jupyter Notebook requires JavaScript.
Please enable it to proceed.
...

To ignore the error, you can press Q and then press Y to confirm.
A log of the activities of the Jupyter Notebook will be printed to the terminal. When you run Jupyter Notebook, it runs on a specific port number. The first notebook you are running will usually run on port 8888. To check the specific port number Jupyter Notebook is running on, refer to the output of the command used to start it:


Output
[I NotebookApp] Serving notebooks from local directory: /home/sammy
[I NotebookApp] 0 active kernels 
[I NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/
[I NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

If you are running Jupyter Notebook on a local Linux computer (not on a Droplet), you can simply navigate to localhost:8888 to connect to Jupyter Notebook. If you are running Jupyter Notebook on a Droplet, you will need to connect to the server using SSH tunneling as outlined in the next section.
At this point, you can keep the SSH connection open and keep Jupyter Notebook running or can exit the app and re-run it once you set up SSH tunneling. Let's keep it simple and stop the Jupyter Notebook process. We will run it again once we have SSH tunneling working. To stop the Jupyter Notebook process, press CTRL+C, type Y, and hit ENTER to confirm. The following will be displayed:


Output
[C 12:32:23.792 NotebookApp] Shutdown confirmed
[I 12:32:23.794 NotebookApp] Shutting down kernels

Step 4 — Connecting to the Server Using SSH Tunneling

In this section we will learn how to connect to the Jupyter Notebook web interface using SSH tunneling. Since Jupyter Notebook is running on a specific port on the Droplet (such as :8888, :8889 etc.), SSH tunneling enables you to connect to the Droplet's port securely.
The next two subsections describe how to create an SSH tunnel from 1) a Mac or Linux and 2) Windows. Please refer to the subsection for your local computer.

SSH Tunneling with a Mac or Linux

If you are using a Mac or Linux, the steps for creating an SSH tunnel are similar to the How To Use SSH Keys with DigitalOcean Droplets using Linux or Mac guide except there are additional parameters added in the ssh command. This subsection will outline the additional parameters needed in the ssh command to tunnel successfully.
SSH tunneling can be done by running the following SSH command:


ssh -L 8000:localhost:8888 your_server_username@your_server_ip

The ssh command opens an SSH connection, but -L specifies that the given port on the local (client) host is to be forwarded to the given host and port on the remote side (Droplet). This means that whatever is running on the second port number (i.e. 8888) on the Droplet will appear on the first port number (i.e. 8000) on your local computer. You should change 8888 to the port which Jupyter Notebook is running on. Optionally change port 8000 to one of your choosing (for example, if 8000 is used by another process). Use a port greater or equal to 8000 (ie 8001, 8002, etc.) to avoid using a port already in use by another process. server_username is your username (i.e. sammy) on the Droplet which you created and your_server_ip is the IP address of your Droplet. For example, for the username sammy and the server address 111.111.111.111, the command would be:


ssh -L 8000:localhost:8888 sammy@111.111.111.111

If no error shows up after running the ssh -L command, you can run Jupyter Notebook:


jupyter notebook

Now, from a web browser on your local machine, open the Jupyter Notebook web interface with http://localhost:8000 (or whatever port number you chose).

SSH Tunneling with Windows and Putty

If you are using Windows, you can also easily create an SSH tunnel using Putty as outlined in How To Use SSH Keys with PuTTY on DigitalOcean Droplets (Windows users).
First, enter the server URL or IP address as the hostname as shown:
Set Hostname for SSH Tunnel

Next, click SSH on the bottom of the left pane to expand the menu, and then click Tunnels. Enter the local port number to use to access Jupyter on your local machine. Choose 8000 or greater (ie 8001, 8002, etc.) to avoid ports used by other services, and set the destination as localhost:8888 where :8888 is the number of the port that Jupyter Notebook is running on. Now click the Add button, and the ports should appear in the Forwarded ports list:
Forwarded ports list

Finally, click the Open button to connect to the server via SSH and tunnel the desired ports. Navigate to http://localhost:8000 (or whatever port you chose) in a web browser to connect to Jupyter Notebook running on the server.

Step 5 — Using Jupyter Notebook

This section goes over the basics of using Jupyter Notebook. By this point you should have Jupyter Notebook running, and you should be connected to it using a web browser. Jupyter Notebook is very powerful and has many features. This section will outline a few of the basic features to get you started using the notebook. Automatically, Jupyter Notebook will show all of the files and folders in the directory it is run from.
To create a new notebook file, select New > Python 2 from the top right pull-down menu:

This will open a notebook. We can now run Python code in the cell or change the cell to markdown. For example, change the first cell to accept Markdown by clicking Cell > Cell Type > Markdown from the top navigation bar. We can now write notes using Markdown and even include equations written in LaTeX by putting them between the $$ symbols. For example, type the following into the cell after changing it to markdown:

# Simple Equation

Let us now implement the following equation:
$$ y = x^2$$

where $x = 2$

To turn the markdown into rich text, press CTRL+ENTER, and the following should be the results:

You can use the markdown cells to make notes and document your code. Let's implement that simple equation and print the result. Select Insert > Insert Cell Below to insert and cell and enter the following code:

x = 2
y = x*x
print y

To run the code, press CTRL+ENTER. The following should be the results:

You now have the ability to include libraries and use the notebook as you would with any other Python development environment!

Conclusion

Congratulations! You should be now able to write reproducible Python code and notes using markdown using Jupyter notebook running on a Droplet. To get a quick tour of Jupyter notebook, select Help > User Interface Tour from the top navigation menu.

Thursday, August 25, 2016

R Script arguments from Command Line

Passing RScript arguments from Command Line

There may be situations where you might want to execute R scripts from command line and passing appropriate arguments as batch. So here is an example below

C:\Users\UserName\Documents\>

"C:\Program Files\R\R-3.3.0\bin\RScript.exe" args.R 2016-08-01 28 1 30 > args.t

In the above I have specified the location of the RScript.exe on my local computer and args.R is the R script that I want to run from command line and the arguments are 2016-08-01 which is a Date in the format yyyy-mm-dd, then 28 and then 1 and 30 and the result of the execution will get saved in args.txt file which is the output of the execution.

Here is the sample R script code below

#READ THE ARGUMENTS

args <- commandArgs(TRUE)

# test if there is at least one argument: if not, return an error

if (length(args)==0) {

stop("First parameter is a required argumenst.n", call.=FALSE)

}

print(args)

#GETTING ARGUMENT OF THE R SCRIPT.

startDate <- as.Date(args[1])

sId <- eval(parse(text=args[2]))

min <- eval(parse(text=args[3]))

max <- eval(parse(text=args[4]))

#write the variables as observations of a dataframe

columnNames <- c('sId', 'startDate', 'min', 'max')

columnValues <- c(sId, startDate, min, max)

df = data.frame(columnNames, columnValues)

str(df)

print(df)

write.csv(df, file = "C:/args.csv", row.names=F)

The data frame created above gets saved as a csv file for review. Please note internally Date is represented as integer in R.

Wednesday, August 17, 2016

Self-Signed Certificates with Microsoft Enhanced RSA and AES Cryptographic Provider

Creating Enhanced SHA256 self-signed certificates

There are 2 options to create self-signed certificates very easily

using windows makecert

The following command can be run from the command prompt to create a self-signed certificate. Based on location of the makecert.exe on you machine, the path might differ. I am using a Windows 8.1

"C:\Program Files (x86)\Windows Kits\8.1\bin\x86\makecert.exe" -n "CN=Local" -r -pe -a sha256 -len 2048 -cy authority -e 03/03/2017 -sv Local.pvk Local.cer

"C:\Program Files (x86)\Windows Kits\8.1\bin\x86\pvk2pfx.exe" -pvk Local.pvk -spc Local.cer -pfx Local.pfx -po MyPassword -sy 24

using openSSL

you can use openSSL that comes with Apache Webserver to get the same thing done as follows

openssl.exe req -x509 -nodes -sha256 -days 3650 -subj "/CN=Local" -newkey rsa:2048 -keyout Local.key -out Local.crt

openssl.exe pkcs12 -export -in Local.crt -inkey Local.key -CSP "Microsoft Enhanced RSA and AES Cryptographic Provider" -out Local.pfx

Difference Between Above two

One major and most important difference between the 2 above is makecert is not able to create the certificate file with CSP of 24 as provided as provided as parameter so while using this *pfx file to sign any XML as SHA256 will give exception like "Invalid Algorithm Specified" because the CSP value remains 1 instead of 24.

The one created by Open SSL will come out with correct CSP value and will give any errors.

Check Keys of Generated Certificate

You can write a small test program to test the Keys generated by the certificates in the above 2 methods.

class Program

{

static void Main(string[] args)

{

var x509Certificate = new X509Certificate2(@"Local.pfx",

"LocalSTS", X509KeyStorageFlags.Exportable);

Console.WriteLine(x509Certificate.ToString(true));

Console.ReadLine();

}

Monday, May 16, 2016

Azure SQL query slow performance

Azure SQL Intermittently slow at intervals

If you do not have any maintenance job scheduled on the database, then you will need to run some maintenance queries to update the stats if your data in the database is having frequent inserts and updates. Here is how you can do it as below

Connect to your database from SSMS (SQL Server Management Studio), or another client of your choosing.

Update all your tables data distribution statistics, with a 100% sampling rate (Fullscan). This data is used by the query optimizer to choose an execution plan for the queries, and it’s vital that they are updated to get effective execution plans.

---------------------------Update statistics of all database tables
DECLARE @sql nvarchar(MAX);
SELECT @sql = (SELECT 'UPDATE STATISTICS ' + DB_NAME() + '.' + rtrim(sc.name) + '.' + rtrim(so.name) + ' WITH FULLSCAN, ALL; '
from sys.sysobjects so
join sys.schemas sc
on so.uid = sc.schema_id
where so.xtype = 'U'
FOR XML PATH(''), TYPE).value('.', 'nvarchar(MAX)');
PRINT @sql
EXEC (@sql)
---------------------------------------------------------------------
Then follow with a recompilation of all objects, by using sp_recompile.

This causes stored procedures, triggers, and user-defined functions to be recompiled the next time that they are run. It does this by dropping the existing plan from the procedure cache forcing a new plan to be created the next time that the procedure or trigger is run. This ensures the new data distribution statistics or indexes are used in execution plans.
----------------------------------------------------------------------
--Force recompilation of all objects

SET QUOTED_IDENTIFIER OFF
DECLARE @sql nvarchar(MAX);
SELECT @sql = (SELECT "EXEC sp_recompile '" + rtrim(sc.name) + "." + rtrim(so.name) + "' "
from sys.sysobjects so
join sys.schemas sc
on so.uid = sc.schema_id
where so.xtype = "U"
FOR XML PATH(''), TYPE).value('.', 'nvarchar(MAX)');
PRINT @sql
EXEC (@sql)
SET QUOTED_IDENTIFIER ON
----------------------------------------------------------------------
References

UPDATE STATISTICS (Transact-SQL)

https://msdn.microsoft.com/en-us/library/ms187348.aspx

sp_recompile (Transact-SQL)

https://msdn.microsoft.com/en-us/library/ms181647.aspx

Search This Blog

Wednesday, June 21, 2023

Installation of Apache Spark on Ubuntu VM Server on Azure

Create VMs

Installation of Putty

Download Spark into desired folder

Download Java

Setting Environment Paths

Install Python Tools

Setting up worker node/cluster

Spark Configuration

Worker Configurations

Python Dev Tools on the Worker

Installing Remote Desktop to Master and Minimum GUI

Connect your Linux VM from a Windows machine

Sample Python Application and PySpark Submission

Install Microsoft SQL JDBC Driver

Testing Connection to SQL Server

Writing to Azure SQL Database

Python Module for MSSQL Client

Sunday, August 8, 2021

Remote login to Ubuntu 20 from Windows 10

Thursday, December 17, 2020

Installing Docker Tools on Ubuntu VMs on Azure to Run Superset

Cloning Superset from repository and change Default Password

Installing Docker

INSTALL DOCKER ENGINE

Install Docker Compose

Compose Superset

Wednesday, July 29, 2020

Enable remote desktop connection on ubuntu VM on azure

Steps to Open Port 3389 on VM

Friday, December 22, 2017

Creating Read-Only Users on Azure SQL

1. Login to the Database Server as Admin and Select Master Database and run the following queries

2. Create User in the Database where the Read-Only permission is required

3. Assign db_datareader persmission to the user on the database

Saturday, October 22, 2016

Starting Jupyter Notebook with Apache Spark

Required Variable Setup

Running the Notebook with Spark Cluster

Monday, October 17, 2016

Introduction

Prerequisites

Step 1 — Installing Python 2.7 and Pip

Step 2 — Installing Ipython and Jupyter Notebook

Step 3 — Running Jupyter Notebook

Step 4 — Connecting to the Server Using SSH Tunneling

SSH Tunneling with a Mac or Linux

SSH Tunneling with Windows and Putty

Step 5 — Using Jupyter Notebook

Conclusion

Thursday, August 25, 2016

Passing RScript arguments from Command Line

Wednesday, August 17, 2016

Creating Enhanced SHA256 self-signed certificates

using windows makecert

using openSSL

Difference Between Above two

Check Keys of Generated Certificate

Monday, May 16, 2016

Azure SQL Intermittently slow at intervals