Character representation in a computer encodes letters, digits, punctuation, and control characters as binary numbers. The most common encoding is ASCII (American Standard Code for Information Interchange), a 7-bit code that covers 128 distinct characters.

Although ASCII uses only 7 bits, characters are typically stored in 8-bit bytes — the ASCII code occupies the low-order seven bits and the high-order bit is set to 0 (or used for parity in some systems).

Why ASCII

ASCII has two important properties:

  1. Sequential ordering: alphabetic characters (AZ, az) and numeric characters (09) are assigned codes in increasing order. So sorting text by treating ASCII codes as unsigned binary numbers automatically gives alphabetical/numerical order.

  2. BCD encoding embedded in digit codes: the low-order four bits of each digit character’s ASCII code are exactly the binary-coded decimal representation of that digit. So '5' has ASCII code — the high nibble is (digit-character marker) and the low nibble is ( in BCD).

This embedded BCD makes converting between ASCII digit characters and numeric values trivial: just mask off the low 4 bits.

Standard ASCII range

Code (decimal)RangePurpose
0–31Control charactersNull, tab, line feed, carriage return, escape, etc.
32SpaceWhitespace
33–47! " # ... .Punctuation
48–5709Digit characters
58–64: ; ... @More punctuation
65–90AZUppercase letters
91–96`[ \ ] ^ _ “Punctuation (95 = _, 96 = backtick)
97–122azLowercase letters
123–127`{} ~ DEL`

The gap between uppercase and lowercase is exactly 32, so tolower(c) = c | 0x20 and toupper(c) = c & ~0x20 for letters. Another bit-twiddle that’s useful in low-level code.

Beyond ASCII

ASCII covers English well but is inadequate for international text. Several extensions and replacements:

  • Extended ASCII / ISO 8859: uses the high bit to encode an additional 128 characters (accents, currency symbols, etc.). Many regional variants.
  • Unicode: encodes every character of every writing system. The standard defines code points in the range U+0000 to U+10FFFF (about 1.1 million values, fitting in 21 bits).
  • UTF-8: variable-length encoding of Unicode that’s backward-compatible with ASCII (ASCII characters use 1 byte, others use 2–4 bytes). The dominant encoding on the web.
  • UTF-16: 16-bit code units; what Java and Windows internally use for strings.

Modern software almost universally uses UTF-8 for new code, with ASCII as a special case.

Why this matters

Character encoding is invisible when everything works and a nightmare when it doesn’t. Mixing encodings (a UTF-8 file read as Latin-1, for example) produces “mojibake” — characters interpreted by the wrong code page show up as gibberish.

For the related concept of how a character’s ASCII bits map to a digit value, see BCD Addition. For the broader context of how memory stores bytes, see Byte addressability and Endianness.