// Databricks notebook source exported at Sun, 19 Jun 2016 03:04:17 UTC

# Scalable Data Science

### prepared by Raazesh Sainudiin and Sivanand Sivaram

This is an elaboration of the Apache Spark 1.6 mllib-progamming-guide on mllib-data-types.

# Overview

## Data Types - MLlib Programming Guide

- Local vector and URL
- Labeled point and URL
- Local matrix and URL
- Distributed matrix and URL
- RowMatrix and URL
- IndexedRowMatrix and URL
- CoordinateMatrix and URL
- BlockMatrix and URL

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.

### RowMatrix in Scala

A `RowMatrix`

is a row-oriented distributed matrix without meaningful
row indices, backed by an RDD of its rows, where each row is a local
vector. Since each row is represented by a local vector, **the number of
columns is limited by the integer range but it should be much smaller in
practice**.

A `RowMatrix`

can be created from an `RDD[Vector]`

instance. Then we can compute its
column summary statistics and decompositions.

- QR decomposition is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
- For singular value decomposition (SVD) and principal component analysis (PCA), please refer to Dimensionality reduction.

Refer to the `RowMatrix`

Scala docs
for details on the API.

```
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
```

```
val rows: RDD[Vector] = sc.parallelize(Array(Vectors.dense(12.0, -51.0, 4.0), Vectors.dense(6.0, 167.0, -68.0), Vectors.dense(-4.0, 24.0, -41.0))) // an RDD of local vectors
```

```
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
```

```
mat.rows.collect
```

```
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
```

```
// QR decomposition
val qrResult = mat.tallSkinnyQR(true)
```

```
qrResult.R
```

### RowMatrix in Python

A `RowMatrix`

can be created from an `RDD`

of vectors.

Refer to the `RowMatrix`

Python docs
for more details on the API.

```
%py
from pyspark.mllib.linalg.distributed import RowMatrix
# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)
# Get its size.
m = mat.numRows() # 4
n = mat.numCols() # 3
print m,'x',n
# Get the rows as an RDD of vectors again.
rowsRDD = mat.rows
```