Computer performance is most commonly measured by how quickly a program executes. Multiple factors influence this: the instruction set, the hardware design, the technology used to fabricate the chip, the operating system, and (for high-level code) the compiler.
This note focuses on the architectural levers — what hardware/architecture choices improve performance — rather than benchmarking methodology.
Technology
The fastest single lever historically: smaller transistors.
The speed at which logic gates switch between and depends largely on transistor size. Smaller transistors:
- Switch faster (less capacitance to charge/discharge).
- Pack more densely, allowing more complex circuits per chip.
- Use less power per operation.
Decades of fabrication advances (driven by Very Large-Scale Integration — VLSI — technology) have followed this curve. See Moore’s law for the doubling rate that has driven computing performance since the 1960s.
The shrinking is reaching physical limits: at 5nm and below, quantum tunneling and atomic-scale variation start to matter. Continued performance gains increasingly come from architectural innovations rather than pure shrinking.
Parallelism
Performance can be improved by performing multiple operations in parallel. Parallelism appears at several levels.
Instruction-level parallelism
The simplest execution model finishes one instruction before starting the next — slow. Pipelining overlaps execution: while instruction is being executed, instruction is being decoded, and is being fetched. See Instruction execution cycle for the basic 5-stage cycle.
Superscalar processors dispatch multiple instructions per cycle to multiple functional units. Modern CPUs can issue 4–8 instructions per cycle, limited by both program-inherent constraints (data dependencies between instructions) and hardware constraints (number of execution ports, register-rename resources, decode width, branch-prediction accuracy, cache miss stalls). On real workloads, achieved instructions-per-cycle is usually well below the issue width because of these combined limits.
Multi-core processors
Modern processors contain multiple cores on a single chip — each a complete processor. Examples: dual-core, quad-core, octo-core, server CPUs with 64+ cores.
Each core can run an independent instruction stream. So a program with multiple threads (parallel computation) can use multiple cores simultaneously. A single-threaded program only uses one core regardless of how many are available.
Multiprocessors
Some systems contain multiple physical CPU chips, each with multiple cores. Shared-memory multiprocessors let all processors access the same main memory, sharing data and synchronizing via memory operations.
Pros: easy programming model (any processor can access any memory).
Cons: contention for memory bandwidth, complex cache coherence, scaling limits.
Multicomputers (distributed systems)
When you connect multiple complete computers over a network, each with its own private memory, you get a multicomputer (also called a cluster). Communication uses message passing rather than shared memory.
Pros: scales to thousands or millions of nodes (datacenters, supercomputers).
Cons: programmer must explicitly handle communication, much higher latency than shared memory.
What limits performance
Three things constrain how fast a computer can run a program:
- Computation rate: how fast the cores execute instructions. Limited by clock speed, ILP, and cache hits.
- Memory bandwidth: how fast data flows between memory and CPU. Limited by bus width and DRAM speed.
- Memory latency: how long it takes to fetch a value from memory. Hidden by caches when access patterns have locality.
Modern CPU/memory speed gap means memory often dominates. A cache miss costs hundreds of cycles. This is why caches are so important.
Beyond raw speed
Other performance dimensions:
- Throughput: total work done per unit time. Servers care about this more than latency.
- Latency: time from request to first response. Real-time systems care about this.
- Energy efficiency: performance per watt. Critical for mobile and datacenter.
- Cost-effectiveness: performance per dollar.
Different applications optimize different dimensions. A scientific simulation prioritizes raw FLOPS; a smartphone prioritizes battery life; a database server prioritizes throughput at predictable latencies.
For the technology-driven scaling, see Moore’s law. For the architectural building blocks, see Cache memory and Instruction execution cycle. For the broader history, see History of computers.