Infrastructure

Distributed Computing Requirements

We need an infrastructure of computing machinery to store and process big data. Many real-world datasets cannot be analysed on your laptop or a single desktop computer. We will typically need a cluster of several computers to cope with big data. Such clusters are hosted in a public or private cloud, i.e. a cluster of computers that is housed elsewhere but accessible remotely.

Note: We can learn how to use Apache Spark on or from our laptop and even analyse big data sets from our laptop by running the jobs in the cloud.

databricks - managed clusters in the cloud

One of the easiest way of doing is through a fully managed Spark cluster databricks cluster:

Getting your laptop ready

Thus, to work with big data we need to first:

Advanced Topics in self-managed clusters

We will see some more advanced topics on infrastructure in the sequel and these can be skipped by most readers.

The following are advanced topics in self-managed cloud computing (it is optional for the SDS-2.2 course). One typically uses a powerful laptop to develop and deploy such insfrastructures in one of three environments: