// Databricks notebook source exported at Sun, 19 Jun 2016 03:00:57 UTC

# Scalable Data Science

### prepared by Raazesh Sainudiin and Sivanand Sivaram

This is an elaboration of the Apache Spark 1.6 mllib-progamming-guide on mllib-data-types.

# Overview

## Data Types - MLlib Programming Guide

- Local vector and URL
- Labeled point and URL
- Local matrix and URL
- Distributed matrix and URL
- RowMatrix and URL
- IndexedRowMatrix and URL
- CoordinateMatrix and URL
- BlockMatrix and URL

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.

## Local Matrix in Scala

A local matrix has integer-typed row and column indices and double-typed
values, **stored on a single machine**. MLlib supports:

- dense matrices, whose entry values are stored in a single double array in column-major order, and
- sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order.

For example, the following dense matrix:
\(\begin{pmatrix} 1.0 & 2.0 \\ 3.0 & 4.0 \\ 5.0 & 6.0 \end{pmatrix}\)
is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]`

with the matrix size `(3, 2)`

.

The base class of local matrices is
`Matrix`

,
and we provide two implementations:
`DenseMatrix`

,
and
`SparseMatrix`

.
We recommend using the factory methods implemented in
`Matrices`

to create local matrices. Remember, local matrices in MLlib are stored
in column-major order.

Refer to the `Matrix`

Scala docs
and `Matrices`

Scala docs
for details on the API.

```
Int.MaxValue // note the largest value an index can take
```

```
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
```

Next, let us create the following sparse local matrix: \(\begin{pmatrix} 9.0 & 0.0 \\ 0.0 & 8.0 \\ 0.0 & 6.0 \end{pmatrix}\)

```
// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))
```

## Local Matrix in Python

The base class of local matrices is
`Matrix`

,
and we provide two implementations:
`DenseMatrix`

,
and
`SparseMatrix`

.
We recommend using the factory methods implemented in
`Matrices`

to create local matrices. Remember, local matrices in MLlib are stored
in column-major order.

Refer to the `Matrix`

Python docs
and `Matrices`

Python docs
for more details on the API.

```
%py
from pyspark.mllib.linalg import Matrix, Matrices
# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])
dm2
```

```
%py
# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
sm
```