019_DistLAlgForLinRegIntro - sds-3.x/ScaDaMaLe

ScaDaMaLe Course site and book

A more detailed deep dive

We will mostly be using ML algorithms already implemented in Spark's libraries and packages.

However, in order to innovate and devise new algorithms, we need to understand more about distributed linear algebra and optimisation algorithms.

read: http://arxiv.org/pdf/1509.02256.pdf (also see References and Appendix A).
and go through the notebooks prepended by '19x' on various Data Types for distributed linear algebra.

You may want to follow this course from Stanford for a guided deeper dive. * http://stanford.edu/~rezab/dao/ * see notes here: https://github.com/lamastex/scalable-data-science/blob/master/read/daosu.pdf

Communication Hierarchy

In addition to time and space complexity of an algorithm, in the distributed setting, we also need to take care of communication complexity.

Access rates fall sharply with distance.

roughly 50 x gap between reading from memory and reading from either disk or the network.

We must take this communication hierarchy into consideration when developing parallel and distributed algorithms.

Focusing on strategies to reduce communication costs.

access rates fall sharply with distance.
so this communication hierarchy needs to be accounted for when developing parallel and distributed algorithms.

Lessons:

parallelism makes our computation faster
but network communication slows us down
SOLUTION: perform parallel and in-memory computation.
Persisting in memory is a particularly attractive option when working with iterative algorithms that read the same data multiple times, as is the case in gradient descent.
Several machine learning algorithms are iterative!
Limits of multi-core scaling (powerful multicore machine with several CPUs, and a huge amount of RAM).
- advantageous:
  - sidestep any network communication when working with a single multicore machine
  - can indeed handle fairly large data sets, and they're an attractive option in many settings.
- disadvantages:
  - can be quite expensive (due to specialized hardware),
  - not as widely accessible as commodity computing nodes.
  - this approach does have scalability limitations, as we'll eventually hit a wall when the data grows large enough! This is not the case for a distributed environment (like the AWS EC2 cloud under the hood here).

Simple strategies for algorithms in a distributed setting: to reduce network communication, simply keep large objects local

In the big n, small d case for linear regression
- we can solve the problem via a closed form solution.
- And this requires us to communicate \(O(d)^2\) intermediate data.
- the largest object in this example is our initial data, which we store in a distributed fashion and never communicate! This is a data parallel setting.
In the big n, big d case (n is the sample size and d is the number of features):
- for linear regression.
  - we use gradient descent to iteratively train our model and are again in a data parallel setting.
  - At each iteration we communicate the current parameter vector \(w_i\) and the required \(O(d)\) communication is feasible even for fairly large d.
In the small n, small d case:
- for ridge regression.
  - we can communicate the small data to all of the workers.
  - this is an example of a model parallel setting where we can train the model for each hyper-parameter in parallel.
Linear regression with big n and huge d is an example of both data and model parallelism.

In this setting, since our data is large, we must still store it across multiple machines. We can still use gradient descent, or stochastic variants of gradient descent to train our model, but we may not want to communicate the entire d dimensional parameter vector at each iteration, when we have 10s, or hundreds of millions of features. In this setting we often rely on sparsity to reduce the communication. So far we discussed how we can reduce communication by keeping large data local.