Big data

Big data is data that’s too large to be stored or processed by a single machine: datasets where no single disk can hold the bytes, no single memory can hold a working slice, or no single CPU can finish the computation in a reasonable time. At that scale, single-machine techniques (whether files like HDF5 or a server-class Relational database) stop working, and we switch to distributed storage and computation.

The idea is straightforward. Instead of one machine, we use many. We connect them over a network and treat them as a single logical system. Each individual machine is a node, physical or virtual; the distinction doesn’t matter from outside. A group of nodes connected to cooperate as one machine is a cluster.

A cluster lets us do two things we couldn’t do before:

Store more data than fits on any one node, by splitting the dataset into pieces and putting different pieces on different nodes.
Do more computation than any one node can finish, by handing pieces of the computation to different nodes and combining the results.

This sounds easy in two sentences. In practice it raises a lot of questions. How is the data split? Who keeps track of where each piece lives? What happens when a node fails, and at sufficient scale some node is always failing? How do we coordinate the work? How do we get the pieces back together at the end? These questions are what big-data frameworks exist to answer.

A standard industry taxonomy describes big data by the 3Vs, introduced by analyst Doug Laney in 2001: Volume (raw size), Velocity (rate of arrival, e.g. log streams, sensor feeds, social media firehose), and Variety (mix of structured tables, semi-structured JSON, unstructured text, images, audio). Later expansions add Veracity (uncertainty about correctness) for a 4V framing and Value (whether anything useful comes out) for the 5V framing. The Vs are a marketing taxonomy more than a rigorous definition: useful as a shared vocabulary, not as a checklist a dataset has to satisfy.

The standard framework is Apache Hadoop, with three core components: HDFS (distributed storage), MapReduce (a programming model for distributed computation), and YARN (resource management across the cluster). Modern systems often use Hadoop for storage and successor frameworks like Spark or Dask for computation: same shape, faster iterative algorithms.

Idriss Rami — Notes

Explorer

Big data

Graph View

Backlinks