%md
This is an elaboration of the [Apache Spark mllib-progamming-guide on mllib-data-types](http://spark.apache.org/docs/latest/mllib-data-types.html).
# [Overview](/#workspace/scalable-data-science/xtraResources/ProgGuides2_2/MLlibProgrammingGuide/000_MLlibProgGuide)
## [Data Types - MLlib Programming Guide](/#workspace/scalable-data-science/xtraResources/ProgGuides2_2/MLlibProgrammingGuide/dataTypes/000_dataTypesProgGuide)
- [Local vector](http://spark.apache.org/docs/latest/mllib-data-types.html#local-vector)
- [Labeled point](http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point)
- [Local matrix](http://spark.apache.org/docs/latest/mllib-data-types.html#local-matrix)
- [Distributed matrix](http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix)
- [RowMatrix](http://spark.apache.org/docs/latest/mllib-data-types.html#rowmatrix)
- [IndexedRowMatrix](http://spark.apache.org/docs/latest/mllib-data-types.html#indexedrowmatrix)
- [CoordinateMatrix](http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix)
- [BlockMatrix](http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix)
MLlib supports local vectors and matrices stored on a single machine, as
well as distributed matrices backed by one or more RDDs. Local vectors
and local matrices are simple data models that serve as public
interfaces. The underlying linear algebra operations are provided by
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/). A
training example used in supervised learning is called a “labeled point”
in MLlib.
This is an elaboration of the Apache Spark mllib-progamming-guide on mllib-data-types.
Overview
Data Types - MLlib Programming Guide
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.
%md
Labeled point in Scala
-------------
A labeled point is a local vector, either dense or sparse, associated
with a label/response. In MLlib, labeled points are used in supervised
learning algorithms.
We use a double to store a label, so we can use
labeled points in both regression and classification.
For binary classification, a label should be either `0` (negative) or `1`
(positive). For multiclass classification, labels should be class
indices starting from zero: `0, 1, 2, ...`.
A labeled point is represented by the case class
[`LabeledPoint`](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).
Refer to the [`LabeledPoint` Scala docs](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint)
for details on the API.
Labeled point in Scala
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms.
We use a double to store a label, so we can use labeled points in both regression and classification.
For binary classification, a label should be either 0
(negative) or 1
(positive). For multiclass classification, labels should be class
indices starting from zero: 0, 1, 2, ...
.
A labeled point is represented by the case class
LabeledPoint
.
Refer to the LabeledPoint
Scala docs
for details on the API.
%md
***Sparse data in Scala***
It is very common in practice to have sparse training data. MLlib
supports reading training examples stored in `LIBSVM` format, which is
the default format used by
[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a
text format in which each line represents a labeled sparse feature
vector using the following format:
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading,
the feature indices are converted to zero-based.
[`MLUtils.loadLibSVMFile`](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.util.MLUtils$)
reads training examples stored in LIBSVM format.
Refer to the [`MLUtils` Scala docs](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.util.MLUtils)
for details on the API.
Sparse data in Scala
It is very common in practice to have sparse training data. MLlib
supports reading training examples stored in LIBSVM
format, which is
the default format used by
LIBSVM
and
LIBLINEAR
. It is a
text format in which each line represents a labeled sparse feature
vector using the following format:
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.
MLUtils.loadLibSVMFile
reads training examples stored in LIBSVM format.
Refer to the MLUtils
Scala docs
for details on the API.
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
//val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // from prog guide but no such data here - can wget from github
%md
## Load MNIST training and test datasets
Our datasets are vectors of pixels representing images of handwritten digits. For example:


Load MNIST training and test datasets
Our datasets are vectors of pixels representing images of handwritten digits. For example:
examples.take(1)
%md
Display our data. Each image has the true label (the `label` column) and a vector of `features` which represent pixel intensities (see below for details of what is in `training`).
Display our data. Each image has the true label (the label
column) and a vector of features
which represent pixel intensities (see below for details of what is in training
).
display(examples.toDF) // covert to DataFrame and display for convenient db visualization
Truncated results, showing first 715 rows.
%md
The pixel intensities are represented in `features` as a sparse vector, for example the first observation, as seen in row 1 of the output to `display(training)` below, has `label` as `5`, i.e. the hand-written image is for the number 5. And this hand-written image is the following sparse vector (just click the triangle to the left of the feature in first row to see the following):
```
type: 0
size: 780
indices: [152,153,155,...,682,683]
values: [3, 18, 18,18,126,...,132,16]
```
Here
* `type: 0` says we hve a sparse vector.
* `size: 780` says the vector has 780 indices in total
* these indices from 0,...,779 are a unidimensional indexing of the two-dimensional array of pixels in the image
* `indices: [152,153,155,...,682,683]` are the indices from the `[0,1,...,779]` possible indices with non-zero values
* a value is an integer encoding the gray-level at the pixel index
* `values: [3, 18, 18,18,126,...,132,16]` are the actual gray level values, for example:
* at pixed index `152` the gray-level value is `3`,
* at index `153` the gray-level value is `18`,
* ..., and finally at
* at index `683` the gray-level value is `18`
The pixel intensities are represented in features
as a sparse vector, for example the first observation, as seen in row 1 of the output to display(training)
below, has label
as 5
, i.e. the hand-written image is for the number 5. And this hand-written image is the following sparse vector (just click the triangle to the left of the feature in first row to see the following):
type: 0
size: 780
indices: [152,153,155,...,682,683]
values: [3, 18, 18,18,126,...,132,16]
Here
type: 0
says we hve a sparse vector.size: 780
says the vector has 780 indices in total- these indices from 0,...,779 are a unidimensional indexing of the two-dimensional array of pixels in the image
indices: [152,153,155,...,682,683]
are the indices from the[0,1,...,779]
possible indices with non-zero values- a value is an integer encoding the gray-level at the pixel index
values: [3, 18, 18,18,126,...,132,16]
are the actual gray level values, for example:- at pixed index
152
the gray-level value is3
, - at index
153
the gray-level value is18
, - ..., and finally at
- at index
683
the gray-level value is18
- at pixed index
ScaDaMaLe Course site and book