Databricks notebook source exported at Tue, 28 Jun 2016 09:35:44 UTC

Scalable Data Science

prepared by Paul Brouwers, Raazesh Sainudiin and Sivanand Sivaram

supported by and

** Students of the Scalable Data Science Course at UC, Ilam **

  • First check if a cluster named classClusterTensorFlow is running.
  • If it is then just skip this notebook and attach the next notebook to classClusterTensorFlow

TensorFlow initialization scripts

This notebook explains how to install TensorFlow on a large cluster. It is not required for the Databricks Community Edition.

The TensorFlow library needs to be installed directly on all the nodes of the cluster. We show here how to install complex python packages that are not supported yet by the Databricks library manager. Such libraries are directly installed using cluster initialization scripts (“init scripts” for short). These scripts are Bash programs that run on a compute node when this node is being added to a cluster.

For more information, please refer to the init scripts in the Databricks guide.

These scripts require the name of the cluster. If you use this notebook, you will need to change the name of the cluster in the cell below:

Step 1. Set cluster variable and check


# Change the value to the name of your cluster:
clusterName = "classClusterTensorFlow"

To check if the init scripts are already in this cluster.


dbutils.fs.ls("dbfs:/databricks/init/%s/" % clusterName)

If ``pillow-install.sh and tensorflow-install.sh` are already in this cluster then skip Step 2 below.

Step 2. To (re)create init scripts

If the .sh files above are not there, then evaluate the cell below and restart the cluster.

Sub-step 2.1

The following commands create init scripts that install the TensorFlow library on your cluster whenever it gets started or restarted. If you do not want to have TensorFlow installed on this cluster by default, you need to remove the scripts, by running the following command:

  dbutils.fs.rm("dbfs:/databricks/init/%s/tensorflow-install.sh" % clusterName)
  dbutils.fs.rm("dbfs:/databricks/init/%s/pillow-install.sh" % clusterName)

The next cell creates the init scripts. You need to restart your cluster after running the following command.


dbutils.fs.mkdirs("dbfs:/databricks/init/")
dbutils.fs.put("dbfs:/databricks/init/%s/tensorflow-install.sh" % clusterName,"""
#!/bin/bash 
/databricks/python/bin/pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
""", True)

# This is just to get nice image visualization
dbutils.fs.put("dbfs:/databricks/init/%s/pillow-install.sh" % clusterName,"""
#!/bin/bash 
echo "------ packages --------"
sudo apt-get -y --force-yes install libtiff5-dev libjpeg8-dev zlib1g-dev
echo "------ python packages --------"
/databricks/python/bin/pip install pillow
""", True)

Sub-step 2.2 You now need to restart your cluster.

3. How to check that the scripts ran correctly after running a cluster (possibly by restarting)

As explained in the Databricks guide, the output of init scripts is stored in DBFS. The following cell accesses the latest content of the logs after a cluster start:


stamp = str(dbutils.fs.ls("/databricks/init/output/%s/" % clusterName)[-1].name)
print("Stamp is %s" % stamp)
files = dbutils.fs.ls("/databricks/init/output/%s/%s" % (clusterName, str(stamp)))
tf_files = [str(fi.path) for fi in files if fi.name.startswith("%s-tensorflow-install" % clusterName)]
logs = [dbutils.fs.head(fname) for fname in tf_files]
for log in logs:
  print "************************"
  print log

Scalable Data Science

prepared by Paul Brouwers, Raazesh Sainudiin and Sivanand Sivaram

supported by and