An HDF5 dataset is the leaf of the HDF5 hierarchy — the actual stored data. Conceptually it’s a multi-dimensional array of homogeneous numerical type, with a name, a shape, a dtype, and optional Metadata attached as attributes. Datasets play the role that files play in a Linux filesystem; groups play the role of directories.

We create a dataset by calling create_dataset on the file object or on a group:

with h5py.File('./hdf5_data.h5', 'w') as hdf:
    hdf.create_dataset('dataset1', data=matrix_1)

This writes a NumPy array into the file under the name dataset1. The dataset stores not just the data but also the shape (1000, 1000) and the dtype (say float64, written <f8 in HDF5’s notation — little-endian, 8 bytes, floating point).

When we read a dataset back, h5py returns a wrapper object — an h5py.Dataset instance — that represents the on-disk dataset without loading it into memory yet:

with h5py.File('./hdf5_data.h5', 'r') as hdf:
    dataset1 = hdf.get('dataset1')
    print(type(dataset1))            # <class 'h5py._hl.dataset.Dataset'> — type-check against h5py.Dataset
    my_array = np.array(dataset1)    # forces the bytes to be read off disk
    print(type(my_array))            # numpy.ndarray

Wrapping in np.array(...) materializes the bytes. This two-step pattern — open a handle, then materialize when we actually need the values — is what lets HDF5 do partial reads: if the dataset is a 100-GB array and we only need a 100-MB slice, we can slice the handle (dataset1[1000:2000]) and only the relevant bytes get read.

Datasets can be created with compression — gzip, lzf, or szip — to trade write/read speed for disk space.