// Databricks notebook source exported at Sun, 19 Jun 2016 03:00:57 UTC

# Scalable Data Science

### prepared by Raazesh Sainudiin and Sivanand Sivaram

supported by and

This is an elaboration of the Apache Spark 1.6 mllib-progamming-guide on mllib-data-types.

# Overview

## Data Types - MLlib Programming Guide

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.

## Local Matrix in Scala

A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports:

• dense matrices, whose entry values are stored in a single double array in column-major order, and
• sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order.

For example, the following dense matrix: $$\begin{pmatrix} 1.0 & 2.0 \\ 3.0 & 4.0 \\ 5.0 & 6.0 \end{pmatrix}$$ is stored in a one-dimensional array [1.0, 3.0, 5.0, 2.0, 4.0, 6.0] with the matrix size (3, 2).

The base class of local matrices is Matrix, and we provide two implementations: DenseMatrix, and SparseMatrix. We recommend using the factory methods implemented in Matrices to create local matrices. Remember, local matrices in MLlib are stored in column-major order.

Refer to the Matrix Scala docs and Matrices Scala docs for details on the API.


Int.MaxValue // note the largest value an index can take



import org.apache.spark.mllib.linalg.{Matrix, Matrices}

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))



Next, let us create the following sparse local matrix: $$\begin{pmatrix} 9.0 & 0.0 \\ 0.0 & 8.0 \\ 0.0 & 6.0 \end{pmatrix}$$


// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))



## Local Matrix in Python

The base class of local matrices is Matrix, and we provide two implementations: DenseMatrix, and SparseMatrix. We recommend using the factory methods implemented in Matrices to create local matrices. Remember, local matrices in MLlib are stored in column-major order.

Refer to the Matrix Python docs and Matrices Python docs for more details on the API.


%py
from pyspark.mllib.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])
dm2



%py
# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
sm



supported by and

Updated: