HDF5 (Hierarchical Data Format, version 5) is the standard file format for scientific data that doesn’t fit neatly into a spreadsheet — large numerical arrays, multi-modal recordings, simulation outputs, sensor archives. The name is precise: it’s hierarchical (data is organized in a tree like a filesystem), it’s a data format (a way of laying bytes on disk), and it’s a complete library and toolset built around that format, not just a specification.

Several properties make HDF5 the right answer for large scientific datasets:

  • Heterogeneous: a single HDF5 file can contain numerical arrays, text, images, tables, and metadata, all under one roof.
  • Handles large and complex data: HDF5 files can be terabytes and stay efficient at that size.
  • Operating-system independent: a file written on Linux can be opened on Mac or Windows without conversion.
  • Open and language-agnostic: official libraries exist for Python, MATLAB, R, C, Fortran, and many others.
  • Self-describing: the metadata describing what’s in the file (arrays, shapes, data types) lives inside the file itself. No need to ship separate documentation.
  • Partial reading (data slicing): if a file contains a 100-GB array and you want a 100-MB slice, HDF5 reads exactly that slice off disk, without loading the rest — provided the dataset is chunked (or the slice runs along the contiguous axis). Without chunking, a slice across the slow axis still requires reading much more than the slice itself.

Mentally, an HDF5 file looks like a Linux filesystem. There’s a root group, written /, that contains everything else. Inside the root group there can be groups (analogous to directories) and datasets (analogous to files), and groups can contain other groups recursively. A dataset at path /MyImages/1.jpg lives inside a group called MyImages that lives inside the root.

We work with HDF5 in Python through h5py, which lets us treat an open file as if it were a Python dictionary with group and dataset names as keys.

import numpy as np, h5py
matrix_1 = np.random.random((1000, 1000))
with h5py.File('./hdf5_data.h5', 'w') as hdf:
    hdf.create_dataset('dataset1', data=matrix_1)

The with block opens the file, the create_dataset call writes a NumPy array under the name dataset1, and the file closes automatically when the block ends. Reading is symmetric: open in 'r' mode, pull a dataset out by name, wrap it in np.array(...) to materialize the bytes into memory.

HDF5 supports per-dataset compression with gzip (lossless, ubiquitous, safe default), lzf (lossless, faster, less compressed), or szip (lossless, tuned for correlated scientific data, patent-encumbered).