Rootless Spark

Installing Spark-Hadoop-Yarn-Hive-Zeppelin without Root Access

By Dan Lilja with assistance from Tilo Wiklund

This guide will help you set up an Apache Spark cluster both in standalone mode and together with Apache Hadoop’s HDFS and YARN along with Apache Hive and Apache Zeppelin, all without requiring root access. It assumes a basic familiarity with Spark, OpenSSH, and Bash (the use of which will be assumed throughout this guide). This guide assumes the following setup:

  • A computer which you use to connect to other machines on the network. Could be your own computer, a workstation, or something similar.
  • A number of networked machines which you can connect to. These will be used as master and workers for the Spark cluster.
  • The same username for the master and all the workers.

The master will be referred to as separate from the workers but the same machine that is running the master node could also run a worker process. The guide also assumes using Hadoop version 2.8.0 and Spark veriosn 2.1.1 though it should apply to other versions with only minor changes.

Video of the Meetup

Uppsala Big Data Meetup Video of the Event

The preconfigured files for the setup from the video are included in the rootless-files folder. Note that they require modification to be usable for your particular setup.

Requirements

  • OpenSSH (or alternative) installed on each machine with execution privileges,
  • SSH login for each machine and access to each machine from your computer and the chosen master,
  • read and execute permissions on your /home/user/ folder on each machine.

Preparations

Before beginning, decide which machine to use as master and which machines to use as workers. Make sure you can SSH to each worker from the master and vice versa. The hostnames or IP addresses of each machine will be needed. Using Bash, run the command ifconfig or ip addr to find the IP address of the current machine.

Since we will be commnicating with the machines using SSH and Spark also communicate via SSH we will set up a passwordless login to each machine and from the master to each worker.

Generate a public/private keypair on your computer

If you do not have a public/private keypair on your computer the first step will be to generate one.

Make sure that you have a directory called .ssh in your home directory by running ls -a ~. If it does not exist, run the command mkdir ~/.ssh.

To create the keypair run the command ssh-keygen from the .ssh directory in your home folder and select a filename for your private key when prompted. The corresponding public key will be created as filename.pub. If you want or require a specific type of keypair run the command ssh-keygen -t [type] where [type] is your desired keypair type, for example ed25519.

Generate a public/private keypair for your master

Generate a new public/private keypair to be used solely for connecting the master to the workers. Since you will need to upload the private key to the master you do NOT want to use your own key to set up passwordless login from the master to the workers.

Make sure that you have a directory called .ssh in your home directory by running ls -a ~. If it does not exist, run the command mkdir ~/.ssh.

To create the keypair run the command ssh-keygen from the .ssh directory in your home folder and select a filename for your private key when prompted. The corresponding public key will be created as filename.pub. If you want or require a specific type of keypair run the command ssh-keygen -t [type] where [type] is your desired keypair type, for example ed25519.

Setting up passwordless logins

With the necessary keypairs created we are ready to set up the passwordless SSH login. We will need to set up passwordless logins from your computer to the master and the workers using the computer’s keypair and passwordless logins from the master to each worker using the dedicated keypair. Make sure that each machine has a directory called .ssh in your home directory by running ls -a ~. If it is missing on any machine run the command mkdir ~/.ssh.

First copy the public/private keypair for your master with the command scp ~/.ssh/[keyfile] ~/.ssh/[keyfile].pub [username]@[master]:.ssh where [keyfile] is the filename given to the private key, [username] is your username on the master, and [master] is the hostname or IP address of the master. Do NOT use your personal private/public keypair for this.

Next, add your computer’s public key to the authorized_keys file on the master by running the command ssh-copy-id -i ~/.ssh/[keyfile].pub [username]@[master]' on your computer, where [keyfile].pub is your public key, [username] is your username on the master, and [master] is the hostname or IP address of the master. At this point you might want to test that the passwordless login is working by running ssh [username]@[master] from your computer.

From the master, run the command ssh-copy-id -i ~/.ssh/[keyfile].pub [username]@[worker], where [keyfile].pub is the public key generated for the master node, [username] is your username on a worker, and [worker] is the hostname or IP address of a worker. Do this for each worker. Test that it is working by running ssh [username]@[worker] from the master.

Setting up Spark

Downloading binaries

Download pre-built Spark binaries: http://spark.apache.org/downloads.html

Download Java JRE binaries: [http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html#] (http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html)

Extract the archives to a folder of your choice. The rest of this guide will assume that they have been extracted to your home folder. If they are not in your home folder, change the paths accordingly.

Note: The Spark root folder will be referred to as [spark] and the Java JRE root folder will be referred to as [jre].

Configuring Spark

Before we can begin using Spark we sill have to edit the configuration files.

Begin by copying the file ~/[spark]/conf/spark-env.sh.template using the command cp ~/[spark]/conf/spark-env.sh.template ~/[spark]/conf/spark-env.sh. This will copy its contents to the new file spark-env.sh in the ~/[spark]/conf/ folder.

Open the newly created file spark-env.sh and add the following lines:

  • export JAVA_HOME="/home/[username]/[jre]"
  • SPARK_MASTER_HOST="[master]"

where [username] is your username for the master and the workers and [master] is the hostname or IP address of the master. You can also make other changes as appropriate. All Spark configuration options are described in the comments of the file spark-env.sh.template or the file spark-env.sh you just created.

Next create the file slaves in the ~/[spark]/conf/ by copying the template using the command cp ~/[spark]/conf/slaves.template ~/[spark]/conf/slaves.

Open the newly created file slaves and add, for each worker, the line [worker] where [worker] is the hostname or IP address of that worker. You may also want to remove the line localhost so that a worker will not be started on your own computer.

Copying to machines

With everything configured properly we need to copy all the files to the master and each worker. Do this by first running the commands scp -r ~/[spark] ~/[jre] [username]@[master]: where [username] is your username and [master] is the hostname or IP address of the master. Similarly, for each worker, run scp -r ~/[spark] ~/[jre] [username]@[worker]: where [username] is your username and [worker] is the hostname or IP address of the worker.

Note: If you’re asked for your SSH login password during this then passwordless SSH login is not conigured properly.

Starting and testing

After everything is copied it’s time to make sure that everything works. To start everything at once run the script start-all.sh found in the ~/[spark]/sbin/ folder. If everything goes well, open a browser and go to [master]:8080. You should be able to see the Spark webUI with all workers connected. To launch a Spark shell against this cluster run ~/[spark]/bin/spark-shell --master spark://[master]:7077 where [master] is the hostname or IP address of the master. Once started you should be able to see the application in the Spark web-ui. You can also use spark-submit with appropriate options to submit jobs to the master. For example, spark-submit --class org.apache.spark.examples.SparkPi --master spark://[master]:7077 --deploy-mode cluster examples/jars/spark-examples_2.11-2.1.1.jar 10 to run an example that computes an approximation of Pi.

Setting up Hadoop

The next step is to use Spark as a part of Hadoop with YARN as the resource manager. This will, among other things, let Spark access the HDFS for data to analyze and several isntances to run simultaneously. For this we will first configure HDFS and YARN.

The binaries for Hadoop can be found here: http://hadoop.apache.org/releases.html

This guide will assume that it has been extracted to the same folder as the Spark folder, i.e. your home folder. The Hadoop root folder will be referred to as [hadoop]. We will also assume that you want the same master and worker setup as before.

Configuring HDFS

We will start with HDFS. Once HDFS is configured and running we will add YARN.

As with Spark, we first need to tell Hadoop where to look for Java. To do this, open the file ~/[hadoop]/etc/hadoop/hadoop-env.sh and change the line export JAVA_HOME=${JAVA_HOME} to the export JAVA_HOME="/home/[username]/[jre]".

Next, open the file ~/[hadoop]/etc/hadoop/core-site.xml. Between the opening <configuration> tag and the closing </configuration> add the following:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop_tmp</value>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://[master]:9000</value>
</property>

where [master] is the hostname or IP address of the master. The folder /tmp/hadoop_tmp can be changed to anything you like as long as it is somewhere you have write privileges and needs to exist on each machine you intend to use.

Next open the file ~/[hadoop]/etc/hadoop/hdfs-site.xml. between the opening <configuration> tag and the closing </configuration> tag add the following:

<property>
  <name>dfs.replication</name>
  <value>k</value>
</property>

<property>
  <name>dfs.namenode.rpc-bind-host</name>
  <value>0.0.0.0</value>
</property>

<property>
  <name>dfs.namenode.servicerpc-bind-host</name>
  <value>0.0.0.0</value>
</property>

where k is the number of copies of a file to store on the DFS.

Lastly we need to let Hadoop know which slaves to use. To do this create a file called slaves in [hadoop]/etc/hadoop/ and add the hostname or IP address of each slave to use.

Starting HDFS

Before we can start HDFS we need to format the DFS. Do this by running [hadoop]/bin/hdfs namenode -format. After it has completed succesfully we can start the DFS by running [hadoop]/sbin/start-dfs.sh from the master. To check that it is working, start a browser and go to [master]:50070 where [master] is the hostname or the IP address of the master. You should get the web interface of HDFS. From here you can check the status of datanodes, explor the filesystem, view logs, etc.

Configuring YARN

Next we will configure YARN so that we can use it as the resource manager for Spark, among other things.

First add the following to the file [hadoop]/etc/hadoop/yarn-site.xml:

<property>
  <name>yarn.nodemanager.local-dirs</name>
  <value>/tmp/hadoop_tmp</value>
</property>

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>[master]</value>
</property>

<property>
  <name>yarn.resourcemanager.bind-host</name>
  <value>0.0.0.0</value>
</property>

<property>
  <name>yarn.nodemanager.host</name>
  <value>0.0.0.0</value>
</property>

<property>
  <name>yarn.resourcemanager.bind-host</name>
  <value>0.0.0.0</value>
</property>

<property>
  <name>yarn.nodemanager.local-dirs</name>
  <value>/tmp/hadoop_tmp</value>
</property>

Next add export JAVA_HOME=[jre] to the file [hadoop]/etc/hadoop/yarn-env.sh where [jre] is the JRE root folder.

Starting YARN

Start YARN by running [hadoop]/sbin/start-yarn.sh from the master. To check that YARN is running try to connect to the web interface at [master]:8088.

Configuring Spark to run with YARN

With YARN and HDFS running we want to configure Spark to use them.

Begin by creating a directory /spark in the DFS to hold the Spark .jar files. To do this, run [hadoop]/bin/hdfs dfs -mkdir /spark. To upload the .jar files run [hadoop]/bin/hdfs dfs -put [spark]/jars /spark. Check that the files were uploaded correctly by using the HDFS web interface or by running [hadoop]/bin/hdfs dfs -ls /spark/jars.

Next add the line spark.yarn.archive hdfs:///spark/jars to the file [spark]/conf/spark-defaults.conf to point Spark towards the uploaded files.

Lastly, and optionally, run a test program with [spark]/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.11-2.1.1.jar 10. Note that if you are using a different version of Spark you will have to change 2.1.1 to your version.

Downloading Hive

The Apache Hive binaries can be found at: https://hive.apache.org/downloads.html

This guide assumes version 2.1.1. Download and extract the archive to a folder of your choosing. We will assume it is extracted to the same folder that contains Spark and Hadoop. The Hive root folder will be referred to as [hive].

Configuring Hive

Make a copy of the file [hive]/conf/hive-env.sh.template and name it [hive]/conf/hive-env.sh. Open it and add the following lines:

export JAVA_HOME=[jre]
export HADOOP_HOME=[hadoop]

where as before [jre] is the JRE root folder and [hadoop] is the Hadoop root folder.

Starting Hive

Before we can use Hive we need to create a couple of folders in the DFS for Hive. To do this run the following commands:

[hadoop]/bin/hdfs dfs -mkdir /tmp
[hadoop]/bin/hdfs dfs -mkdir /user/hive
[hadoop]/bin/hdfs dfs -mkdir /user/hive/warehouse
[hadoop]/bin/hdfs dfs -chmod g+w /tmp
[hadoop]/bin/hdfs dfs -chmod g+w /user/hive
[hadoop]/bin/hdfs dfs -chmod g+w /user/hive/warehouse

Next, initialize the database by running [hive]/bin/schematool -dbType [type] -initSchema where [type] is the type of database you want to initialize, e.g. derby.

We are now ready to start the Hive server and connect to it. To start the server run [hive]/bin/hiveserver2. After it has started run [hive]/bin/beeline -u jdbc:hive2://[master]:10000 where [master] is the hostname or IP address of the machine running the Hive server. We will assume that the Hive server is running on the HDFS master.

If you get an error message about your user not being allowed to impersonate, add the following lines to the [hadoop]/etc/hadoop/core-site.xml file:

<property>
  <name>hadoop.proxyuser.dan.group</name>
  <value>*</value>
</property>

<property>
  <name>hadoop.proxyuser.dan.hosts</name>
  <value>*</value>
</property>

Downloading Zeppelin

The Apache Zeppelin binaries can be downloaded at: https://zeppelin.apache.org/download.html

We will assume version 0.7.1 for this guide. Download ad extract the archive to a folder of your choosing. We will assume it is extracted to the same folder that contains Spark and Hadoop. The Zeppelin root folder will be referred to as [zeppelin].

Configuring Hive

Copy the file [zeppelin]/conf/zeppelin-env.sh.template to [zeppelin]/conf/zeppelin-env.sh and add the following lines:

export JAVA_HOME=[jre]
export HADOOP_HOME=[hadoop]
export SPARK_HOME=[spark]
export HIVE_HOME=[hive]

where [jre] is the root folder of the JRE binaries, [hadoop] is the Hadoop root folder, [spark] is the Spark root folder and [hive] is the Hive root folder.

Next run the command [zeppelin]/bin/zeppelin-daemon.sh start, open a web browser and go to the address [master]:8080. In the top right there is a drop down menu. Select the option “Interpreter” and find the Spark section. Change the value of the property master to yarn-client to use the YARN resource manager. Next find the jdbc section. To use Hive, change default.driver to org.apache.hive.jdbc.HiveDriver, change default.url to jdbc:hive2:[master]:10000, and change default.user to your user name. Lastly add the following dependencies:

[hive]/jdbc/hive-jdbc-2.1.1-standalone.jar
[hadoop]/share/hadoop/common/hadoop-common-2.8.0.jar

Testing Zeppelin

Test if Zeppelin is working by creating a notebook and running some code Spark and Hive commands. The jobs should also show up on the YARN web interface. If you are having trouble, try adding the following lines to the file [hadoop]/etc/hadoop/yarn-site.xml and restarting YARN:

<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>

<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

<property>
  <name>yarn.timeline-service.hostname</name>
  <value>[master]</value>
</property>

<property>
  <name>yarn.timeline-service.bind-host</name>
  <value>0.0.0.0</value>
</property>

Updated: