Skip to content

03 Floating Point Systems (IEEE 754)

1. Introduction to the IEEE 754 Standard

The IEEE 754 Standard is the universal convention used by modern computers to represent real numbers (decimals). * The Concept: Unlike integers which use a direct counting method (linear), IEEE 754 uses Binary Scientific Notation (\(1.xxxxx \times 2^y\)). * The Goal: To solve the trade-off between Range (how large/small a number can be) and Precision (how accurate the number is) within a fixed number of bits. * Non-Uniformity: The numbers it can represent are not evenly spaced. They are extremely dense near zero and become progressively sparser as the magnitude increases.

2. Anatomy of IEEE 754 (Storage Structure)

Any floating-point number \(V\) is encoded by breaking it down into three non-continuous components based on the formula: $$ V = (-1)^S \times (1.M) \times 2^{(E - Bias)} $$

A. The Sign Bit (\(S\))

  • Size: 1 bit.
  • Function: Determines positive (0) or negative (1).
  • Independence: The sign is stored separately from the value, allowing for the existence of distinct +0 and -0.

B. The Exponent (\(E\)) and The Bias

  • Size: 8 bits (float) or 11 bits (double).
  • Function: Determines the Order of Magnitude. It controls how far the binary point "floats" left or right.
  • Biased Representation (Shifted Code):
    • The standard uses an Unsigned Integer minus a fixed Bias (127 for float, 1023 for double) to represent the real exponent.
    • \[ RealExponent = StoredExponent - Bias \]
    • Purpose: This allows hardware to compare the magnitude of floating-point numbers using fast, simple unsigned integer comparison logic.

C. The Mantissa (\(M\)) and The Implicit Bit

  • Size: 23 bits (float) or 52 bits (double).
  • Function: Determines the Precision (Significant Figures).
  • Implicit Leading Bit:
    • In binary scientific notation (\(1.xxxxx \times 2^E\)), the leading digit is always 1 (unless the number is zero).
    • Optimization: The computer assumes this "1." exists but does not store it, gaining 1 extra bit of precision for free.

3. Precision and Range Analysis

A. Range Derivation (Max Magnitude)

  • Float: Max effective exponent \(\approx 127\) \(\rightarrow\) Range \(\approx \pm 3.4 \times 10^{38}\).
  • Double: Max effective exponent \(\approx 1023\) \(\rightarrow\) Range \(\approx \pm 1.7 \times 10^{308}\).
  • Boundary: Exceeding this range results in Overflow (\(\pm\)Infinity).

B. The Hole around Zero (The Underflow Gap)

Floating-point numbers cannot represent values arbitrarily close to zero. The valid range is not continuous. * The Gap: There is a "forbidden zone" between 0 and the Smallest Normalized Positive Number (Min). * For float, this Min is approximately \(1.175 \times 10^{-38}\). * The Structure: On the number line, representable numbers look like: [-Max ... -Min] ... (GAP) ... 0 ... (GAP) ... [+Min ... +Max] * Underflow: If you try to store a number smaller than this Min (e.g., \(10^{-50}\)), the exponent cannot go any lower. The value falls into the gap and is forced to 0.

C. Precision Derivation (Significant Figures)

  • Float: 24-bit effective precision \(\approx\) 6-7 decimal digits.
  • Double: 53-bit effective precision \(\approx\) 15-16 decimal digits.
  • Root Cause of Error: Decimal fractions (like 0.1) often become infinite repeating fractions in binary. Since the Mantissa bits are finite, these infinite series are truncated, causing permanent precision loss.

4. Special States and Anomalies

The IEEE 754 standard reserves specific bit patterns to handle boundary cases.

A. Infinity (Inf)

  • Pattern: Exponent = All 1s, Mantissa = All 0s.
  • Origin: Division by zero (1.0/0.0) or Overflow.
  • Behavior: A valid state that propagates (e.g., Inf + 1 = Inf).

B. Not a Number (NaN)

  • Pattern: Exponent = All 1s, Mantissa \(\neq\) 0.
  • Origin: Undefined operations (e.g., sqrt(-1), 0.0/0.0).
  • Behavior: Any operation with NaN yields NaN. NaN == NaN is always False.

C. Signed Zero (-0)

  • Pattern: Sign bit = 1, Exponent = 0, Mantissa = 0.
  • Behavior: Mathematically equal to +0 in comparison, but mathematically distinct in limiting operations (e.g., 1.0 / -0.0 = -Inf).

5. Engineering Traps and Best Practices

A. The Comparison Trap

  • Problem: Due to truncation errors, calculated floats rarely match bit-for-bit.
  • Rule: Never use == to compare floats.
  • Solution: Check if the difference is within an Epsilon (small margin of error). c if (fabs(a - b) < 1e-9) { /* Treated as equal */ }

B. The I/O Trap (Scanf vs Printf)

  • Input (scanf): Requires strict type matching.
    • float \(\rightarrow\) %f.
    • double \(\rightarrow\) %lf. (Using %f here corrupts memory).
  • Output (printf): float is automatically promoted to double; %f works for both.

C. The Default Type Rule

  • Literal Constants: Raw decimals (e.g., 3.14) are implicitly double.
  • Optimization: Use the f suffix (e.g., 3.14f) to define a float literal explicitly, preventing unnecessary double-to-float conversion instructions.